The analysis provides a complete rundown of different visualizations and their insights into the dataset. Our first bar plot demonstrates a gap in region counts, this can be seen as the Southeast region has the highest count. Next, with our BMI histogram one can notice an equal distribution, while the charges histogram on the other hand is skewed left. This box plot also highlights regional inequalities in BMI distribution. This can be observed within the Southeast region that has a high IQR and contains outliers within the data. Another visual that can be observed is a scatterplot that shows a positive correlation between age and charges, where different colors are used to indicate who is, or is not a smoker. Our pie chart then demonstrates how cases with no or one child are more common, unlike cases with more children that are less common. When analyzing our box plot we can see numerous outliers within different numbers of children. This can be seen more often in cases with zero, one, two, and three children, which can be a result of variations within the data. Overall, the following visualizations provide insights into the variations amongst different regions and relationships amongst variables. With this information it allows for further analysis and improvements to be made.
#1. Read the data file insurance.csv using the
read_csv() function in tidyverse.
```’{r} library(readr) insurance_data <- read_csv(“insurance.data.csv”) head(insurance_data) ````
#2. Get a glimpse of the data and indicate the number of observations and the number of variables in the data.
# A tibble: 6 × 7
age sex bmi children smoker region charges
<dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 19 female 27.9 0 yes southwest 16885.
2 18 male 33.8 1 no southeast 1726.
3 28 male 33 3 no southeast 4449.
4 33 male 22.7 0 no northwest 21984.
5 32 male 28.9 0 no northwest 3867.
6 31 female 25.7 0 no southeast 3757.
[1] 1338 7
#3. Create a bar plot of region. Use a few sentences to summarize your finding based on the plot.
When looking at the bar plot for the distribution of regions one may notice that the largest region southeast with a count over 350, while the other three all have a count just a little over 300. One may also notice that the mean distribution seems to be somewhere between a count of 350 and 325.
#4. Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region. You should make sure that your y axis shows percents.
#5. Create a histogram of bmi. Discuss the distribution of the histogram.
When looking at the histogram for bmi one may notice a unilateral distribution as the information is evenly spread throughout, with no left or right skew being observed.
#6. Create a histogram of charges. Discuss the distribution of the histogram.
For the following histogram you can see the visual is skewed left, with the maximum frequency being around 125 when the charges near zero.
#7. Create a boxplot that shows the distribution of bmi based on the region. Discuss what you find based on the boxplot. (Hint: you need to have x and y variables in mapping)
After taking a look at the following boxplot, you can see the mean value for the distribution of BMI based on region is a little of 30. We can also see once again the southeast has the largest IQR compared to other regions, but unlike other visuals a boxplot allows us to see outliers in the data. These can be seen with the dots placed above certain boxplots.
#8. Create a scatterplot that shows the relationship between age (independent variable) and charges (dependent variable). Comment on the scatterplot.
When looking at the scatter plot there seems to be a positive relationship between age and charges. This can be seen with how the dots move slightly vertical as they move to the right on the chart. It is not a huge shift so this could possibly mean the relationship is moderate or potentially weak.
#9. You should find that it seems “charges” could be classified into several groups. Let’s create a scatterplot that has age as the independent variable (x) and has smoker as another categorical variable (color), and the response variable is charges. Comment on the scatterplot.
Compared to the other scatterplot the varying colors makes the data easier to read. Not only does it now show a postive relationship, but it specifies the relationship according to whether or not they are a smoker. By doing this it changes the colors, and allows a line of dots to be seen between smokers and nonsmokers near the 2000 charges area.
#10. Now, create two data frames by subsetting insurance data as follows:
smoker <- insurance[insurance$smoker==“yes”]
nonsmoker <- insurance[insurance$smoker==“no”]
# A tibble: 274 × 7
age sex bmi children smoker region charges
<dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 19 female 27.9 0 yes southwest 16885.
2 62 female 26.3 0 yes southeast 27809.
3 27 male 42.1 0 yes southeast 39612.
4 30 male 35.3 0 yes southwest 36837.
5 34 female 31.9 1 yes northeast 37702.
6 31 male 36.3 2 yes southwest 38711
7 22 male 35.6 0 yes southwest 35586.
8 28 male 36.4 1 yes southwest 51195.
9 35 male 36.7 1 yes northeast 39774.
10 60 male 39.9 0 yes southwest 48173.
# ℹ 264 more rows
# A tibble: 1,064 × 7
age sex bmi children smoker region charges
<dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 18 male 33.8 1 no southeast 1726.
2 28 male 33 3 no southeast 4449.
3 33 male 22.7 0 no northwest 21984.
4 32 male 28.9 0 no northwest 3867.
5 31 female 25.7 0 no southeast 3757.
6 46 female 33.4 1 no southeast 8241.
7 37 female 27.7 3 no northwest 7282.
8 37 male 29.8 2 no northeast 6406.
9 60 female 25.8 0 no northwest 28923.
10 25 male 26.2 0 no northeast 2721.
# ℹ 1,054 more rows
#11. Create a scatterplot that has age as the independent variable (x) and the response variable is charges using the data frame smoker. Then add the smooth line. Comment on the plot. Does it make sense to use the smooth line to summarize the relationship between age of clients and the corresponding charges? Why?
No because here we are using categorical data/variables, if the data/variables were continuous then it would make sense to use a smooth line to summarize the relationship.
#12. Repeat Question 11 using the data frame nonsmoker.
#13. Based on the finding you have on Questions 11 & 12, propose what you might do next if you want to model charges using other variables in this data.
One area of interest I believe would be using the variable BMI. I feel that by uisng BMI it can find another variable that could have influence on the number of charges. This influence could be either a postive, negative, or no relationship.
#14. Create a pie chart of children. Use a few sentences to summarize your finding based on the plot. (Hint: You need to convert the variable to a categorical variable first)
After observing the pie chart one may find that the two largest distributions are zero and one child as they take up about 60 to 70 percent of the pie chart. On the other hand 1 2, 3, 4 and more children were the lower distributions, taking up just about 30 to 40 percent of the pie chart.
#15. Create a boxplot that shows the distribution of charges based on the number of children. Discuss what you find based on the boxplot.
At first glance of the boxplot the first thing that stands out is the number of outliers in the data. This can be seen especially with zero children where there are numerous outliers, as shown with the dots on the boxplot. Along with zero children, one, two, and three children also had a concerning amount of outliers within the data. One may notice the similarities in data that having two or three children share as both share outliers, and have both similar IQR and ranges. After glancing at all the following IQR’s one may also notice that most of the data for each number of child seems to be skewed left as most have their median near the bottom of their IQR.
---
title: "Assignment 7"
author: "Luke Keirn"
date: "2024-03-14"
output:
flexdashboard::flex_dashboard:
source_code: embed
orientation: columns
vertical_layout: fill
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(DT)
library(plotly)
insurance_data <- read_csv("insurance.data.csv")
```
Summary
===
The analysis provides a complete rundown of different visualizations and their insights into the dataset. Our first bar plot demonstrates a gap in region counts, this can be seen as the Southeast region has the highest count. Next, with our BMI histogram one can notice an equal distribution, while the charges histogram on the other hand is skewed left. This box plot also highlights regional inequalities in BMI distribution. This can be observed within the Southeast region that has a high IQR and contains outliers within the data. Another visual that can be observed is a scatterplot that shows a positive correlation between age and charges, where different colors are used to indicate who is, or is not a smoker. Our pie chart then demonstrates how cases with no or one child are more common, unlike cases with more children that are less common. When analyzing our box plot we can see numerous outliers within different numbers of children. This can be seen more often in cases with zero, one, two, and three children, which can be a result of variations within the data. Overall, the following visualizations provide insights into the variations amongst different regions and relationships amongst variables. With this information it allows for further analysis and improvements to be made.
Question One
===
column {data-width=450}
---
#1. Read the data file `insurance.csv` using the `read_csv()` function in `tidyverse`.
Column {.tabset data-wdith=550}
---
```'{r}
library(readr)
insurance_data <- read_csv("insurance.data.csv")
head(insurance_data)
````
Question Two
===
column {data-width=450}
---
#2. Get a glimpse of the data and indicate the number of observations and the number of variables in the data.
Column {.tabset data-wdith=550}
---
````{r glimpse_num_obs_num_var}
head(insurance_data)
dim(insurance_data)
````
Question Three
===
column {data-width=450}
---
#3. Create a bar plot of region. Use a few sentences to summarize your finding based on the plot.
When looking at the bar plot for the distribution of regions one may notice that the largest region southeast with a count over 350, while the other three all have a count just a little over 300. One may also notice that the mean distribution seems to be somewhere between a count of 350 and 325.
Column {.tabset data-wdith=550}
---
````{r}
library(ggplot2)
ggplot(data = insurance_data, aes(x = region)) +
geom_bar() +
labs(title = "Distribution of Regions",
x = "Region",
y = "Count")
````
Question Four
===
column {data-width=450}
---
#4. Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region. You should make sure that your y axis shows percents.
Column {.tabset data-wdith=550}
---
````{r}
library(ggplot2)
library(dplyr)
smoker_percent <- insurance_data %>%
group_by(region, smoker) %>%
summarise(count = n()) %>%
mutate(percent = count / sum(count) * 100)
ggplot(smoker_percent, aes(x = region, y = percent, fill = smoker)) +
geom_bar(stat = "identity") +
labs(title = "Distribution of Smokers by Region",
x = "Region",
y = "Percentage") +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
````
Question Five
===
column {data-width=450}
---
#5. Create a histogram of bmi. Discuss the distribution of the histogram.
When looking at the histogram for bmi one may notice a unilateral distribution as the information is evenly spread throughout, with no left or right skew being observed.
Column {.tabset data-wdith=550}
---
````{r}
library(ggplot2)
ggplot(insurance_data, aes(x = bmi)) +
geom_histogram(binwidth = 2, fill = "blue", color = "black") +
labs(title = "Distribution of BMI",
x = "BMI",
y = "Frequency") +
theme_minimal()
````
Question Six
===
column {data-width=450}
---
#6. Create a histogram of charges. Discuss the distribution of the histogram.
For the following histogram you can see the visual is skewed left, with the maximum frequency being around 125 when the charges near zero.
Column {.tabset data-wdith=550}
---
````{r}
library(ggplot2)
ggplot(insurance_data, aes(x = charges)) +
geom_histogram(binwidth = 1000, fill = "blue", color = "black") +
labs(title = "Distribution of Charges",
x = "Charges",
y = "Frequency") +
theme_minimal()
````
Question Seven
===
column {data-width=450}
---
#7. Create a boxplot that shows the distribution of bmi based on the region. Discuss what you find based on the boxplot. (Hint: you need to have x and y variables in mapping)
After taking a look at the following boxplot, you can see the mean value for the distribution of BMI based on region is a little of 30. We can also see once again the southeast has the largest IQR compared to other regions, but unlike other visuals a boxplot allows us to see outliers in the data. These can be seen with the dots placed above certain boxplots.
Column {.tabset data-wdith=550}
---
````{r}
library(ggplot2)
ggplot(insurance_data, aes(x = region, y = bmi)) +
geom_boxplot(fill = "blue", color = "black") +
labs(title = "Distribution of BMI Based on Region",
x = "Region",
y = "BMI") +
theme_minimal()
````
Question Eight
===
column {data-width=450}
---
#8. Create a scatterplot that shows the relationship between age (independent variable) and charges (dependent variable). Comment on the scatterplot.
When looking at the scatter plot there seems to be a positive relationship between age and charges. This can be seen with how the dots move slightly vertical as they move to the right on the chart. It is not a huge shift so this could possibly mean the relationship is moderate or potentially weak.
Column {.tabset data-wdith=550}
---
````{r}
library(ggplot2)
age_charges_scatter <- ggplot(insurance_data, aes(x = age, y = charges)) +
geom_point(color = "blue") +
labs(title = "Relationship Between Age and Charges",
x = "Age",
y = "Charges")
print(age_charges_scatter)
````
Question Nine
===
column {data-width=450}
---
#9. You should find that it seems "charges" could be classified into several groups. Let's create a scatterplot that has age as the independent variable (x) and has smoker as another categorical variable (color), and the response variable is charges. Comment on the scatterplot.
Compared to the other scatterplot the varying colors makes the data easier to read. Not only does it now show a postive relationship, but it specifies the relationship according to whether or not they are a smoker. By doing this it changes the colors, and allows a line of dots to be seen between smokers and nonsmokers near the 2000 charges area.
Column {.tabset data-wdith=550}
---
````{r}
library(ggplot2)
age_charges_smoker_scatter <- ggplot(insurance_data, aes(x = age, y = charges, color = smoker)) +
geom_point(alpha = 0.6) +
labs(title = "Relationship Between Age, Charges, and Smoker Status",
x = "Age",
y = "Charges",
color = "Smoker")
print(age_charges_smoker_scatter)
````
Question Ten
===
column {data-width=450}
---
#10. Now, create two data frames by subsetting insurance data as follows:
smoker <- insurance[insurance$smoker=="yes"]
nonsmoker <- insurance[insurance$smoker=="no"]
Column {.tabset data-wdith=550}
---
````{r}
smoker <- insurance_data[insurance_data$smoker == "yes", ]
print(smoker)
nonsmoker <- insurance_data[insurance_data$smoker == "no", ]
print(nonsmoker)
````
Question Eleven
===
column {data-width=450}
---
#11. Create a scatterplot that has age as the independent variable (x) and the response variable is charges using the data frame smoker. Then add the smooth line. Comment on the plot. Does it make sense to use the smooth line to summarize the relationship between age of clients and the corresponding charges? Why?
No because here we are using categorical data/variables, if the data/variables were continuous then it would make sense to use a smooth line to summarize the relationship.
Column {.tabset data-wdith=550}
---
````{r}
smoker_age_charges_plot <- ggplot(smoker, aes(x = age, y = charges)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Relationship Between Age and Charges for Smokers",
x = "Age",
y = "Charges") +
theme_minimal()
print(smoker_age_charges_plot)
````
Question Twelve
===
column {data-width=450}
---
#12. Repeat Question 11 using the data frame nonsmoker.
Column {.tabset data-wdith=550}
---
````{r}
nonsmoker_age_charges_plot <- ggplot(nonsmoker, aes(x = age, y = charges)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Relationship Between Age and Charges for Non-Smokers",
x = "Age",
y = "Charges") +
theme_minimal()
print(nonsmoker_age_charges_plot)
````
Question Thirteen
===
column {data-width=450}
---
#13. Based on the finding you have on Questions 11 & 12, propose what you might do next if you want to model charges using other variables in this data.
One area of interest I believe would be using the variable BMI. I feel that by uisng BMI it can find another variable that could have influence on the number of charges. This influence could be either a postive, negative, or no relationship.
Column {.tabset data-wdith=550}
---
Question Fourteen
===
column {data-width=450}
---
#14. Create a pie chart of children. Use a few sentences to summarize your finding based on the plot. (Hint: You need to convert the variable to a categorical variable first)
After observing the pie chart one may find that the two largest distributions are zero and one child as they take up about 60 to 70 percent of the pie chart. On the other hand 1 2, 3, 4 and more children were the lower distributions, taking up just about 30 to 40 percent of the pie chart.
Column {.tabset data-wdith=550}
---
````{r}
insurance_data <- insurance_data %>%
mutate(children_category = case_when(
children == 0 ~ "0 children",
children == 1 ~ "1 child",
children == 2 ~ "2 children",
children == 3 ~ "3 children",
children >= 4 ~ "4 or more children"
))
children_pie_chart <- ggplot(insurance_data, aes(x = "", fill = children_category)) +
geom_bar(width = 1) +
coord_polar("y", start = 0) +
labs(title = "Distribution of Number of Children",
fill = "Number of Children") +
theme_void() +
theme(legend.position = "right")
print(children_pie_chart)
````
Question Fifteen
===
column {data-width=450}
---
#15. Create a boxplot that shows the distribution of charges based on the number of children. Discuss what you find based on the boxplot.
At first glance of the boxplot the first thing that stands out is the number of outliers in the data. This can be seen especially with zero children where there are numerous outliers, as shown with the dots on the boxplot. Along with zero children, one, two, and three children also had a concerning amount of outliers within the data. One may notice the similarities in data that having two or three children share as both share outliers, and have both similar IQR and ranges. After glancing at all the following IQR's one may also notice that most of the data for each number of child seems to be skewed left as most have their median near the bottom of their IQR.
Column {.tabset data-wdith=550}
---
````{r}
charges_children_boxplot <- ggplot(insurance_data, aes(x = factor(children), y = charges)) +
geom_boxplot() +
labs(title = "Distribution of Charges Based on Number of Children",
x = "Number of Children",
y = "Charges") +
theme_minimal()
print(charges_children_boxplot)
````